System Design: Spotify Top-K Songs Tracker
Problem Statementβ
Design a system that tracks and retrieves the top-K most played songs on Spotify in real-time, handling millions of song plays per second with minimal memory footprint.
Requirementsβ
Functional Requirementsβ
- Track song play counts in real-time
- Retrieve top-K most played songs (e.g., top 100)
- Handle massive data streams (millions of plays/second)
- Memory-efficient (sublinear in number of songs)
- Near real-time accuracy (small error acceptable)
Non-Functional Requirementsβ
- High Throughput: Process millions of events/second
- Low Latency: < 100ms for Top-K queries
- Horizontally Scalable: Add nodes to handle load
- Fault Tolerant: No single point of failure
- Approximate Counts: Trade accuracy for efficiency (Β±Ξ΅ error acceptable)
High-Level Architectureβ
Core Componentsβ
1. Event Stream (Apache Kafka)β
Purpose: Ingest and buffer millions of play events
Design Decisions:
- Topics:
song-playstopic with multiple partitions - Partitioning Strategy: Hash by
songIdfor even distribution - Replication Factor: 3 for fault tolerance
- Retention: 7 days for replay capability
Benefits:
- Decouples producers from consumers
- Handles traffic spikes via buffering
- Enables multiple consumers for different analytics
2. Stream Processing Layer (Apache Flink)β
Purpose: Process events in real-time and maintain data structures
Processing Logic:
For each event:
1. Parse songId from event
2. Update Count-Min Sketch
3. Get estimated count from CMS
4. Update Top-K heap if needed
5. Periodically flush to storage
Parallelism:
- Multiple Flink workers process different Kafka partitions
- Each worker maintains local CMS + TopK
- Periodic merge of local structures
State Management:
- Checkpointing enabled (every 1 minute)
- State stored in RocksDB for fault tolerance
3. Count-Min Sketch (CMS)β
Purpose: Space-efficient approximate frequency counter
Structureβ
Operationsβ
Add Operation (when song is played):
For each hash function h_i (i = 1 to depth):
index = h_i(songId) % width
table[i][index] += 1
Estimate Operation (get play count):
estimates = []
For each hash function h_i:
index = h_i(songId) % width
estimates.append(table[i][index])
return min(estimates) // Conservative estimate
Why Minimum?β
The minimum across rows gives the best estimate because:
- Each counter may have hash collisions (overcount)
- The true count is β€ all row values
- Taking minimum reduces collision impact
Parameter Selectionβ
Given desired accuracy (Ξ΅) and confidence (Ξ΄):
-
Width = βe / Ξ΅β
- Ξ΅ = 0.01 (1% error) β width β 272
-
Depth = βln(1/Ξ΄)β
- Ξ΄ = 0.001 (99.9% confidence) β depth β 7
Memory Usage: depth Γ width Γ 4 bytes (int)
- Example: 7 Γ 272 Γ 4 = 7.6 KB per CMS!
4. Top-K Heapβ
Purpose: Maintain the K most played songs efficiently
Structureβ
Min-Heap of size K:
Operationsβ
Insert/Update:
When new count estimate arrives:
IF heap.size < K:
heap.insert(song, count)
ELSE IF count > heap.peek().count:
heap.pop() // Remove minimum
heap.insert(song, count)
Query Top-K:
Return all elements in heap, sorted descending
Time: O(K log K)
Why Min-Heap?
- Root always has the smallest count in Top-K
- Easy to check if new song qualifies: compare with root
- O(log K) insert/delete operations
Data Flow: Play Event Processingβ
Sequence Diagramβ
Step-by-Step Flowβ
-
Event Ingestion (1-5 ms)
- Client sends play event to Load Balancer
- Event published to Kafka topic
- Kafka acknowledges write
-
Stream Processing (10-20 ms)
- Flink consumer reads from Kafka partition
- Extracts songId from event
- Updates Count-Min Sketch in memory
-
Count Estimation (< 1 ms)
- Query CMS for updated count
- Hash songId with all hash functions
- Return minimum value across rows
-
Top-K Update (< 1 ms)
- Compare estimate with heap minimum
- If higher, replace minimum in heap
- Rebalance heap
-
Persistence (Async, periodic)
- Batch write CMS snapshots to Redis (every 10s)
- Write Top-K to Cassandra (every 5s)
- Enables recovery on failure
API Designβ
1. Record Play Event (Internal)β
POST /internal/play
Content-Type: application/json
{
"songId": "550e8400-e29b-41d4-a716-446655440000",
"userId": "user123",
"timestamp": 1698765432000,
"metadata": {
"device": "mobile",
"duration": 180
}
}
Response: 202 Accepted
2. Get Top-K Songsβ
GET /api/v1/topSongs?k=100&timeWindow=24h
Response: 200 OK
{
"timestamp": 1698765432000,
"window": "24h",
"results": [
{
"rank": 1,
"songId": "abc123",
"title": "Shape of You",
"artist": "Ed Sheeran",
"playCount": 5210320,
"confidence": 0.999
},
{
"rank": 2,
"songId": "def456",
"title": "Blinding Lights",
"artist": "The Weeknd",
"playCount": 4901005,
"confidence": 0.999
}
// ... 98 more
]
}
3. Get Song Countβ
GET /api/v1/songs/{songId}/playCount
Response: 200 OK
{
"songId": "abc123",
"playCount": 5210320,
"isApproximate": true,
"errorBound": "Β±1%"
}
Distributed System Designβ
Scaling Strategyβ
Horizontal Scalingβ
βββββββββββββββ
βLoad Balancerβ
ββββββββ¬βββββββ
β
ββββββββββββββββββββΌβββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β Kafka β β Kafka β β Kafka β
βPartitionβ βPartitionβ βPartitionβ
β 0 β β 1 β β 2 β
ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ
β β β
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β Flink β β Flink β β Flink β
β Worker β β Worker β β Worker β
β 1 β β 2 β β 3 β
β β β β β β
β CMS + β β CMS + β β CMS + β
β TopK β β TopK β β TopK β
ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ
β β β
ββββββββββββββββββββΌβββββββββββββββββββ
βΌ
βββββββββββββββ
β Combiner β
β Service β
βββββββββββββββ
CMS Merge Operationβ
Each worker maintains local CMS. Periodically merge:
Global CMS[i][j] = Worker1.CMS[i][j] +
Worker2.CMS[i][j] +
Worker3.CMS[i][j]
Properties:
- Commutative: A + B = B + A
- Associative: (A + B) + C = A + (B + C)
- Perfect for distributed aggregation
Top-K Mergeβ
Each worker reports local Top-K. Combiner:
1. Collect all local Top-K lists
2. Merge into single list
3. Sort by count descending
4. Take top K elements
Optimizationsβ
1. Time-Window Analysisβ
Challenge: Track Top-K for different time windows (last hour, day, week)
Solution: Multiple CMS instances per time window
ββββββββββββββββββββββββββββ
β CMS Manager β
β β
β ββ CMS_1h (last hour) β
β ββ CMS_24h (last day) β
β ββ CMS_7d (last week) β
ββββββββββββββββββββββββββββ
Sliding Window Implementation:
- Divide time into small buckets (e.g., 5-minute buckets)
- Maintain CMS per bucket
- Sum relevant buckets for query window
- Expire old buckets
2. Heavy Hitters Detectionβ
Optimization: Use Space-Saving Algorithm alongside CMS
Maintain small hash map (10K entries) for frequent items:
- Track exact counts for songs that appear frequently
- Falls back to CMS for less popular songs
- Reduces error for top songs
3. Bloom Filter Pre-filteringβ
Use Case: Reduce CMS updates for invalid/test plays
Bloom Filter β Check if songId exists
β
[YES] β Update CMS
[NO] β Reject (invalid song)
4. Caching Layerβ
βββββββββββββββββββββββββββ
β CDN / Edge Cache β
β (Top-K results) β
β TTL: 30 seconds β
βββββββββββββ¬ββββββββββββββ
β
βββββββββββββΌββββββββββββββ
β Redis Cache β
β (CMS snapshots) β
β TTL: 5 minutes β
βββββββββββββ¬ββββββββββββββ
β
βββββββββββββΌββββββββββββββ
β Primary Storage β
β (Cassandra/DynamoDB) β
βββββββββββββββββββββββββββ
Storage Layerβ
Redis (In-Memory Cache)β
Purpose: Fast access to CMS and recent Top-K
Data Structure:
Key: "cms:{timeWindow}"
Value: Serialized CMS array
TTL: 1 hour
Key: "topk:{timeWindow}"
Value: Sorted Set (song β count)
TTL: 5 minutes
Cassandra (Persistent Storage)β
Table Schema:
CREATE TABLE song_stats (
song_id UUID,
time_bucket TIMESTAMP,
play_count BIGINT,
PRIMARY KEY ((time_bucket), play_count, song_id)
) WITH CLUSTERING ORDER BY (play_count DESC);
Benefits:
- Partition by time bucket
- Cluster by play_count (descending)
- Fast Top-K queries
Capacity Estimationβ
Trafficβ
- Daily Active Users: 400M
- Avg Songs/User/Day: 20
- Total Plays/Day: 8 billion
- Plays/Second: ~92,000 (peak: 300,000)
Storageβ
Count-Min Sketch:
- Depth: 7, Width: 272
- Per CMS: 7.6 KB
- 3 time windows Γ 3 replicas: 68 KB total
Top-K Heap (K=1000):
- Per entry: 16 bytes (songId) + 8 bytes (count)
- Total: 24 KB Γ 3 windows = 72 KB
Total Memory per Worker: < 200 KB (incredibly efficient!)
Networkβ
- Event size: ~200 bytes
- Kafka throughput: 300K events/s Γ 200 bytes = 60 MB/s
- Distributed across partitions
Fault Toleranceβ
Strategiesβ
-
Kafka Replication
- Replication factor: 3
- Min in-sync replicas: 2
- Leader election automatic
-
Flink Checkpointing
- Periodic snapshots (1 min)
- State stored in distributed FS
- Exactly-once processing semantics
-
Redis Sentinel/Cluster
- Master-slave replication
- Automatic failover
- No single point of failure
-
Cassandra Multi-DC
- Data replicated across 3 DCs
- Quorum reads/writes
- Self-healing ring architecture
Trade-offs & Considerationsβ
Accuracy vs Memoryβ
- Exact counting: Requires hashmap = O(N) space for N songs
- CMS: O(1/Ξ΅ Γ log(1/Ξ΄)) space, independent of N
- Trade-off: Accept 1% error for 1000Γ memory savings
Latency vs Freshnessβ
- Real-time updates: Higher CPU cost
- Batch updates: Lower cost, slightly stale data
- Choice: Micro-batching (100ms windows)
Consistency vs Availabilityβ
- Strong consistency: Synchronized global state (slow)
- Eventual consistency: Faster, temporary inconsistencies
- Choice: Eventual consistency acceptable for analytics
Monitoring & Observabilityβ
Key Metricsβ
-
Throughput
- Events/second processed
- Kafka lag per partition
-
Latency
- P50, P95, P99 processing time
- End-to-end event latency
-
Accuracy
- Compare CMS estimates vs exact counts (sample)
- Track error distribution
-
Resource Usage
- CPU per Flink worker
- Memory per CMS instance
- Network bandwidth
Alertingβ
- Kafka lag > 10K messages
- Processing latency > 100ms (P95)
- Flink worker failures
- Error rate > 0.1%
Summaryβ
This system achieves:
- β Scalability: Processes millions of plays/second.
- β Efficiency: < 200 KB memory per worker.
- β Accuracy: 99%+ accurate with configurable bounds.
- β Fault Tolerance: No single point of failure.
- β Low Latency: Sub-100ms Top-K queries.
Key Innovation: Count-Min Sketch enables probabilistic counting at massive scale with minimal resources, making real-time analytics feasible.